feat(AGX1-274): record task creator identity and FGAC migration safety#246
feat(AGX1-274): record task creator identity and FGAC migration safety#246asherfink wants to merge 2 commits into
Conversation
13fe4b2 to
7486e5a
Compare
7486e5a to
b9cb26b
Compare
…n creation Adds two nullable creator-audit columns to the tasks table — creator_user_id and creator_service_account_id — populated from the principal context at create time. A CHECK constraint (ck_tasks_one_creator) enforces that at most one is set. This replaces the earlier dual-write draft: grants are already issued unconditionally via grant_with_retry in agents_acp_use_case.py:239, and per-account rollout routing belongs in agentex-auth (private), not in this public Apache-2.0 codebase.
ad1e980 to
3a06be8
Compare
| op.execute( | ||
| "ALTER TABLE tasks ADD CONSTRAINT ck_tasks_at_most_one_creator " | ||
| "CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL)) " | ||
| "NOT VALID" | ||
| ) | ||
| op.execute("ALTER TABLE tasks VALIDATE CONSTRAINT ck_tasks_at_most_one_creator") | ||
|
|
||
|
|
||
| def downgrade() -> None: | ||
| op.drop_constraint("ck_tasks_at_most_one_creator", "tasks", type_="check") | ||
| with op.get_context().autocommit_block(): | ||
| op.execute( |
There was a problem hiding this comment.
NOT VALID + VALIDATE in the same Alembic transaction negates the stated safety guarantee
Both ADD CONSTRAINT ... NOT VALID and VALIDATE CONSTRAINT execute inside the same default Alembic transaction. PostgreSQL holds transaction-level locks until the transaction commits, so the ACCESS EXCLUSIVE lock acquired by ADD CONSTRAINT ... NOT VALID is never released until the entire upgrade() function commits — meaning the full-table validation scan runs while ACCESS EXCLUSIVE is still held. On a large tasks table this blocks all concurrent reads and writes for the same wall-clock duration as a plain ADD CONSTRAINT without NOT VALID. The CONCURRENTLY indexes are correctly placed inside autocommit_block (each statement gets its own autocommit transaction), but the constraint pair is not.
To achieve the stated goal, both constraint statements need to run in separate transactions — either by wrapping them together inside their own autocommit_block (each op.execute becomes its own transaction in autocommit mode, so NOT VALID commits with a brief ACCESS EXCLUSIVE and VALIDATE then runs under SHARE UPDATE EXCLUSIVE with concurrent writes allowed), or by splitting into two separate revisions.
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py
Line: 40-51
Comment:
**NOT VALID + VALIDATE in the same Alembic transaction negates the stated safety guarantee**
Both `ADD CONSTRAINT ... NOT VALID` and `VALIDATE CONSTRAINT` execute inside the same default Alembic transaction. PostgreSQL holds transaction-level locks until the transaction commits, so the `ACCESS EXCLUSIVE` lock acquired by `ADD CONSTRAINT ... NOT VALID` is never released until the entire `upgrade()` function commits — meaning the full-table validation scan runs while `ACCESS EXCLUSIVE` is still held. On a large `tasks` table this blocks all concurrent reads and writes for the same wall-clock duration as a plain `ADD CONSTRAINT` without `NOT VALID`. The CONCURRENTLY indexes are correctly placed inside `autocommit_block` (each statement gets its own autocommit transaction), but the constraint pair is not.
To achieve the stated goal, both constraint statements need to run in separate transactions — either by wrapping them together inside their own `autocommit_block` (each `op.execute` becomes its own transaction in autocommit mode, so `NOT VALID` commits with a brief `ACCESS EXCLUSIVE` and `VALIDATE` then runs under `SHARE UPDATE EXCLUSIVE` with concurrent writes allowed), or by splitting into two separate revisions.
How can I resolve this? If you propose a fix, please make it concise.| if self.dual_write_enabled: | ||
| await self._dual_write_with_retry( | ||
| op_name="register", | ||
| do_call=lambda: self.authorization_service.register_resource( | ||
| AgentexResource.task(task_entity.id), | ||
| parent_resource=AgentexResource.agent(agent.id), | ||
| ), | ||
| task_id=task_entity.id, | ||
| ) |
There was a problem hiding this comment.
Register failure propagates to the caller while the task row exists
When _dual_write_with_retry exhausts all retries, AuthenticationServiceUnavailableError propagates out of create_task. The Postgres row is committed and the task exists, but the caller receives a 5xx. An API client following typical retry-on-error behavior will retry the entire create_task call and create a second task row for the same intent — both rows will have audit columns populated but at most one (the second) will be registered in the auth graph. This is documented as acceptable and covered by the AGX1-291 runbook, but it is worth flagging that the orphan-detection approach (scan by creator_user_id) will not distinguish duplicate rows from a single legitimately-orphaned row if the same principal retried.
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/domain/services/task_service.py
Line: 143-151
Comment:
**Register failure propagates to the caller while the task row exists**
When `_dual_write_with_retry` exhausts all retries, `AuthenticationServiceUnavailableError` propagates out of `create_task`. The Postgres row is committed and the task exists, but the caller receives a 5xx. An API client following typical retry-on-error behavior will retry the entire `create_task` call and create a second task row for the same intent — both rows will have audit columns populated but at most one (the second) will be registered in the auth graph. This is documented as acceptable and covered by the AGX1-291 runbook, but it is worth flagging that the orphan-detection approach (scan by `creator_user_id`) will not distinguish duplicate rows from a single legitimately-orphaned row if the same principal retried.
How can I resolve this? If you propose a fix, please make it concise.
Related work
Parent epic: AGX1-264 — per-task FGAC. Follow-ups bundled in AGX1-291.
This change is part of a 5-PR stack across 3 repos. Merge order: scaleapi/scaleapi#144783 (release sgp-authz 0.7.1) → scaleapi/agentex#353 → scaleapi/agentex#356 → this PR → #249.
Action.CANCELcancelopregister_resourceAPI + cancel cleanupTwo commits — keep them separate during review, the audit-column schema change is independent of the dual-write call sites.
Summary
Commit 1 — passive audit columns:
creator_user_id/creator_service_account_idcolumns to thetaskstable, populated from the request principal onAgentTaskService.create_task. Best-effort (NULLable; see caveat below).CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL))to enforce at-most-one creator type at the DB layer (constraint name:ck_tasks_at_most_one_creator).ix_tasks_creator_user_idandix_tasks_creator_service_account_id(CREATE INDEX CONCURRENTLY) for future "tasks created by X" lookups.Commit 2 — FGAC dual-write call sites + flag:
FGAC_TASKS_DUAL_WRITEenv-var flag, injected intoAgentTaskServicevia FastAPI DI. Gates the dual-write behavior end-to-end.create_taskcallsregister_resource(task, parent_resource=agent)on the authorization service after the Postgres row is persisted, so the task is registered withtenant + owner + parent_agenttuples atomically (via scaleapi/agentex#356's new endpoint).delete_taskcallsderegister_resource(task)after the Postgres delete. Pre-resolves the task id by name first so the post-delete deregister doesn't race the lookup._dual_write_with_retry(op_name, do_call, task_id)helper. RetriesAuthenticationServiceUnavailableError/AuthenticationGatewayErrorwith exponential backoff + jitter (3 retries → 4 total attempts max), mirroringAgentsACPUseCase.grant_with_retry. Non-transient exceptions are not retried.task_fgac_dual_write.attempt|success|retry|failure) tagged withop:register|deregisterandexception_class:<name>on failure — these are the rollout signal for AGX1-291's operator runbook.Migration safety
ALTER TABLE ... ADD CONSTRAINT ... NOT VALID+ALTER TABLE ... VALIDATE CONSTRAINT— splits the operation so the briefACCESS EXCLUSIVElock doesn't have to wait on an existence scan.tasksis high-write; a CHECK addition withoutNOT VALIDwould queue behind in-flight transactions and block readers until released.CONCURRENTLYin anautocommit_block.a1f73ada66c5(add_task_creator_columns).down_revisionis6c942325c828(adding_task_cleaned_at, the current alembic head on main);migration_history.txtregenerated viaalembic history. The ORM-sideCheckConstraintinorm.pymatches the DB-side (same constraint name + predicate).Rollout
register_resourceandderegister_resourcefire on create/delete. If they fail after retries, the Postgres row is still the durable record — orphan auth tuples can be cleaned up out of band per the AGX1-291 operator runbook using the creator-audit columns to identify them.Audit-trail caveat
Creator attribution is best-effort: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs
AgentTaskServicewithoutrequest.state.principal_context) leave both columns NULL. The CHECK constraint allows both-NULL, andtest_no_resolvable_creator_leaves_both_columns_nullexercises this path.What changed
database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py(new): NOT VALID-pattern migration.down_revision = "6c942325c828".src/adapters/orm.py: declarativeCheckConstraintmirroring the DB constraint.src/domain/entities/tasks.py: new optional fields onTaskEntity.src/domain/services/task_service.py:_principal_fieldhelper (handles dict-vs-pydantic principal shape from the authn proxy).create_taskreadscreator_user_id/creator_service_account_idfrom principal context.AgentTaskService.__init__takesdual_write_enabled: DEnvironmentVariable(EnvVarKeys.FGAC_TASKS_DUAL_WRITE)._dual_write_with_retry(op_name, do_call, task_id)keyed by op name; reused from both call sites.src/adapters/authorization/adapter_agentex_authz_proxy.py: forwards to agentex-auth's/v1/authz/registerand/deregister.src/config/environment_variables.py: newFGAC_TASKS_DUAL_WRITEkey.test_task_audit_columns.py— testcontainers Postgres integration tests for the audit columns (creator population, mutual-exclusion CHECK, both-NULL allowed).test_task_fgac_dual_write.py— covers register-on-create, deregister-on-delete, flag-off skip, transient retry-and-succeed (both register and deregister sides), retry exhaustion propagating with the Postgres row preserved, and the name-route ItemDoesNotExist swallow.dual_write_enabledconstructor parameter.Test plan
migration_lint.py— clean.test_task_audit_columns.py— 7/7 pass locally via testcontainers.test_task_fgac_dual_write.py— collects cleanly; runs in CI integration suite.\d tasksshows new columns + constraint + indexes; flip flag on for one account, confirmtask_fgac_dual_write.successfires.Greptile Summary
This PR adds two features: nullable
creator_user_id/creator_service_account_idaudit columns to thetaskstable (populated from the request principal on task creation), and a flag-gated FGAC dual-write path that registers/deregisters tasks in the agentex-auth authorization graph on create/delete. The dual-write is off by default for safe incremental rollout.creator_user_id+creator_service_account_idcolumns, aCHECKconstraint enforcing at-most-one creator type, and concurrent indexes — but theNOT VALID+VALIDATE CONSTRAINTpair runs inside the same Alembic transaction, negating the stated migration-safety benefit (theACCESS EXCLUSIVElock is held for the full validation scan, not just the constraint-addition step).AgentTaskServiceto callregister_resourceaftertask_repository.createandderegister_resourceaftertask_repository.delete, with an exponential-backoff retry helper (3 retries, mirroringgrant_with_retry) and Datadog counters for rollout observability. Test coverage is thorough across both audit-column population and dual-write behavior.Confidence Score: 3/5
Safe to merge the application-layer changes (dual-write logic, audit-column population, new env flag, tests), but the migration needs correction before running against production — it will block all reads and writes on the tasks table for the full duration of the CHECK constraint validation scan.
The dual-write service code, retry helper, ORM changes, and test suite are well-structured and low-risk. The migration is the blocker: the NOT VALID + VALIDATE constraint pair runs inside a single Alembic transaction, so the ACCESS EXCLUSIVE lock acquired by ADD CONSTRAINT is held through the entire table scan. On a large, high-write tasks table this is the production-impact scenario the migration comment explicitly set out to avoid. The fix (wrapping both constraint statements in their own autocommit_block, or splitting into two revisions) is straightforward but must land before the migration is applied in any environment with meaningful data volume.
agentex/database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py — the constraint locking pattern needs to be corrected before this migration runs in production.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Client participant AgentTaskService participant TaskRepository participant AuthorizationService participant AgentexAuthProxy Client->>AgentTaskService: create_task(agent, task_name, ...) AgentTaskService->>AuthorizationService: principal_context (read) AgentTaskService->>TaskRepository: create(task + creator_user_id/creator_service_account_id) TaskRepository-->>AgentTaskService: TaskEntity alt dual_write_enabled AgentTaskService->>AgentTaskService: _dual_write_with_retry(register) loop retry up to 3x on transient error AgentTaskService->>AuthorizationService: "register_resource(task, parent=agent)" AuthorizationService->>AgentexAuthProxy: POST /v1/authz/register end end AgentTaskService-->>Client: TaskEntity Client->>AgentTaskService: delete_task(id/name) alt dual_write_enabled AND name-only lookup AgentTaskService->>TaskRepository: "get(name=name)" TaskRepository-->>AgentTaskService: task_id end AgentTaskService->>TaskRepository: delete(id/name) alt dual_write_enabled AND task_id resolved AgentTaskService->>AgentTaskService: _dual_write_with_retry(deregister) AgentTaskService->>AuthorizationService: deregister_resource(task) AuthorizationService->>AgentexAuthProxy: POST /v1/authz/deregister end AgentTaskService-->>Client: NonePrompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat(AGX1-274): task FGAC dual-write cal..." | Re-trigger Greptile